Introduction

In this assignment, I will be exploring, analyzing and modeing the money ball dataset. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. the purpose of this assignment is to build a multiple linear regression model on the training data to predict the number of wins for the team.

Descriptive Analysis

Variables:

INDEX Identification Variable (do not use)

TARGET_WINS Number of wins

TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR)

TEAM_BATTING_2B Doubles by batters (2B)

TEAM_BATTING_3B Triples by batters (3B)

TEAM_BATTING_HR Homeruns by batters (4B)

TEAM_BATTING_BB Walks by batters Positive

TEAM_BATTING_HBP Batters hit by pitch (get a free base)

TEAM_BATTING_SO Strikeouts by batters

TEAM_BASERUN_SB Stolen bases

TEAM_BASERUN_CS Caught stealing

TEAM_FIELDING_E Errors

TEAM_FIELDING_DP Double Plays

TEAM_PITCHING_BB Walks allowed

TEAM_PITCHING_H Hits allowed

TEAM_PITCHING_HR Homeruns allowed

TEAM_PITCHING_SO Strikeouts by pitchers

# Loading library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(DataExplorer)
library(mice)
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(corrplot)
## corrplot 0.92 loaded
library(reshape)
## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following objects are masked from 'package:reshape':
## 
##     colsplit, melt, recast
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(caret)  # for data splitting and pre-processing
library(stats) 

we have two data sets: -** a training set : where most analysis will be doing -** A evaluation set Which will be used to evaluate the model

# load data money ball 
#evaluation set use for test set 
money_ball_eval <- read.csv('moneyball-evaluation-data.csv')
# training  set 
money_ball_train <- read.csv('moneyball-training-data.csv')
str(money_ball_train)
## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...
introduce(money_ball_train)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 2276      17                0                 17                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                 3478           191              38692       159280
# Data Description 
money_ball_train %>% 
  summary() %>%
  kable() %>% kable_styling() %>%  kable_classic(full_width = F, html_font = "Cambria")
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 750.0 Median :101.0 Median : 49.0 Median :58.00 Median : 1518 Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
NA NA NA NA NA NA NA NA’s :102 NA’s :131 NA’s :772 NA’s :2085 NA NA NA NA’s :102 NA NA’s :286
str(money_ball_train)
## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

Data Preparation

We are going to check distribution for all the columns and outliers that are present in the data set. The Chart below shows the percentage accounted for missing values in the data set. We notice that TEAM_BATTING_HBP has the highest number of missing values but, the mean, the median, the max are around the same rangle which mean that column is skewed centerely.

# Correlation Plot
cor_matrix <- cor(money_ball_train, use = 'complete.obs')
corrplot(cor_matrix, method = 'circle')

money <- money_ball_train
# Create missing data flags

# Create missing data flags
#missing_flag <- ifelse(is.na(money_ball_train$TEAM_BATTING_HBP), 1, 2)
#money_ball_train$missing_flag <- missing_flag


par(mfrow=c(3,3)) 
# Create distribution plot to check outliers 
for (i in 1:17) {
  hist(money[,i],main=names(money[i]),xlab=names(money[i]),breaks = 51)
  boxplot(money[,i], main=names(money[i]), type="l",horizontal = TRUE)
  
  plot(money[,i], money$TARGET_WINS, main = names(money[i]), xlab=names(money[i]))
  abline(lm(money$TARGET_WINS ~ money[,i], data = money), col = "blue")
}

The relationship between the target variable and the predictors is not particularly strong, but there are statistically significant relationships with certain variables. The calculate_correlations_with_pvalues function computes the correlation coefficients and p-values between a given target variable and all other predictor variables in a dataset. It first validates the input to ensure that the target variable is present and that the data contains no missing values. For each predictor, the function performs a correlation test using cor.test(), which returns both the correlation coefficient and the corresponding p-value. These results are stored in a data frame, with both the correlation and p-value rounded to 10 decimal places for precision. This function offers a clear, organized output, making it easy for users to evaluate the strength and statistical significance of the relationships between the target variable and the predictors.

calculate_correlations_with_pvalues <- function(data, target_col) {
  # Check if target_col is a character string
  if (!is.character(target_col) || length(target_col) != 1) {
    stop("target_col must be a single character string.")
  }
  
  # Ensure the target column exists in the data
  if (!target_col %in% names(data)) {
    stop("Target column not found in the dataframe.")
  }
  
  # Remove rows with missing values
  data_complete <- data[complete.cases(data), ]
  
  # Initialize a results data frame
  results <- data.frame(Predictor = character(), Correlation = numeric(), PValue = numeric(), stringsAsFactors = FALSE)
  
  # Loop through each predictor variable
  for (predictor in names(data_complete)[names(data_complete) != target_col]) {
    # Perform the correlation test
    test_result <- cor.test(data_complete[[predictor]], data_complete[[target_col]])
    
    # Store the rounded results to 10 decimal places
    results <- rbind(results, data.frame(Predictor = predictor, 
                                         Correlation = round(test_result$estimate, 10), 
                                         PValue = round(test_result$p.value, 10)))
  }
  
  return(results)
}


# Example usage
correlation_results <- calculate_correlations_with_pvalues(money_ball_train, "TARGET_WINS")

# View the results
print(correlation_results)
##              Predictor Correlation       PValue
## cor              INDEX -0.04895047 0.5012836479
## cor1    TEAM_BATTING_H  0.46994665 0.0000000000
## cor2   TEAM_BATTING_2B  0.31298400 0.0000104138
## cor3   TEAM_BATTING_3B -0.12434586 0.0865523694
## cor4   TEAM_BATTING_HR  0.42241683 0.0000000012
## cor5   TEAM_BATTING_BB  0.46868793 0.0000000000
## cor6   TEAM_BATTING_SO -0.22889273 0.0014476066
## cor7   TEAM_BASERUN_SB  0.01483639 0.8385814774
## cor8   TEAM_BASERUN_CS -0.17875598 0.0133534492
## cor9  TEAM_BATTING_HBP  0.07350424 0.3122327101
## cor10  TEAM_PITCHING_H  0.47123431 0.0000000000
## cor11 TEAM_PITCHING_HR  0.42246683 0.0000000011
## cor12 TEAM_PITCHING_BB  0.46839882 0.0000000000
## cor13 TEAM_PITCHING_SO -0.22936481 0.0014142043
## cor14  TEAM_FIELDING_E -0.38668800 0.0000000329
## cor15 TEAM_FIELDING_DP -0.19586601 0.0066174528

Now let dive into the Missing values.

# 
plot_intro(money_ball_train, title = 'Missing Information on Meny Ball Dataset',
           ggtheme = theme_minimal())

# Plot missing volume in Column 
plot_missing(money_ball_train,title = 'Information about Missing Value in money ball dataset',ggtheme = theme_minimal())

we have discovered that there are some observations that are missing data. These column names will need imputation after analysis. We have about 9% of the data missing that spread out to > TEAM_PITCHING_SO 4.48% > TEAM_BATTING_SO 4.48% > TEAM_BASERUN_SB 5.76% > TEAM_BASERUN_CS 12.57% > TEAM_BATTING_HBP 91.61%

Why Use Predictive Imputation?

Preserves Relationships: Predictive imputation uses the relationships between variables to estimate missing values, which can lead to more accurate and reliable data sets.

Handles Different Types of Missing Data: It can handle data that is Missing Completely at Random (MCAR), Missing at Random (MAR), and even some cases of Missing Not at Random (MNAR). Maintains Variability: Unlike simple imputation methods (mean, median), predictive imputation maintains the natural variability in the data.

# re-assign moneyball
money <- money_ball_train
# create data set with with predictive imputable 
# Compute multiple imputation 
data1 <- mice(money, method = 'pmm', m=5)
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
## Warning: Number of logged events: 25
completed_data <- complete(data1)

money_ball_eval <- mice(money_ball_eval, method = 'pmm', m=5)
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_BATTING_HBP  TEAM_PITCHING_SO  TEAM_FIELDING_DP
## Warning: Number of logged events: 50
money_ball_eval <- complete(money_ball_eval)
# View missing value distribution
plot_missing(completed_data)

plot_scatterplot(money[,-c(1,11)],by="TARGET_WINS", nrow = 1L, ncol = 2L)

## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 903 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 286 rows containing missing values or values outside the scale range
## (`geom_point()`).

# after data transformation 
par(mfrow=c(3,3)) 
for (i in 1:17) {
  hist(completed_data[,i],main=names(completed_data[i]),xlab=names(completed_data[i]),breaks = 51)
  boxplot(completed_data[,i], main=names(completed_data[i]), type="l",horizontal = TRUE)
  
  plot(completed_data[,i], completed_data$TARGET_WINS, main = names(completed_data[i]), xlab=names(completed_data[i]))
  abline(lm(completed_data$TARGET_WINS ~ completed_data[,i], data = completed_data), col = "blue")
}

Dealing with Outliers

The remove_outliers_df function removes outliers from all numeric columns in a dataset using the Interquartile Range (IQR) method. It loops through each column, checks if it’s numeric, and calculates the first (Q1) and third quartiles (Q3) to determine the IQR. It then sets upper and lower bounds as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, and removes any rows where the values in a numeric column fall outside these bounds. Non-numeric columns are ignored, ensuring only numeric data is affected. This results in a dataset with outliers removed from all numeric columns.

remove_outliers_df <- function(df) {
  # Loop through each numeric column in the data frame
  for (col in names(df)) {
    if (is.numeric(df[[col]])) {
      # Calculate Q1, Q3, and IQR for the column
      Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
      Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
      IQR_value <- Q3 - Q1
      
      # Set lower and upper bounds
      lower_bound <- Q1 - 1.5 * IQR_value
      upper_bound <- Q3 + 1.5 * IQR_value
      
      # Remove rows where the value in this column is an outlier
      df <- df[df[[col]] >= lower_bound & df[[col]] <= upper_bound, ]
    }
  }
  return(df)
}

# Remove the outlier in completed dataset 
money_ball <- remove_outliers_df(completed_data)
# Visualize the distribution 
par(mfrow=c(3,3)) 
for (i in 1:17) {
  hist(money_ball[,i],main=names(money_ball[i]),xlab=names(money_ball[i]),breaks = 51)
  boxplot(money_ball[,i], main=names(money_ball[i]), type="l",horizontal = TRUE)
  
  plot(money_ball[,i], money_ball$TARGET_WINS, main = names(money_ball[i]), xlab=names(money_ball[i]))
  abline(lm(money_ball$TARGET_WINS ~ money_ball[,i], data = money_ball), col = "blue")
  
}

# see how the density plot shows distribution 
melt(money_ball) %>% ggplot(aes(x= value)) +
  geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
## No id variables; using all as measure variables

The density plot shows how the values of each variable are distributed across the range. Peaks in the density curve indicate where values are concentrated, while troughs represent areas with fewer observations. After removing outliers, let see if there is an improved correlation among the variables.

# Example usage
corre_updated_results <- calculate_correlations_with_pvalues(money_ball, "TARGET_WINS")

# View the results
print(corre_updated_results)
##              Predictor Correlation       PValue
## cor              INDEX  0.02716337 0.2819341234
## cor1    TEAM_BATTING_H  0.34303778 0.0000000000
## cor2   TEAM_BATTING_2B  0.19432525 0.0000000000
## cor3   TEAM_BATTING_3B  0.13103793 0.0000001867
## cor4   TEAM_BATTING_HR  0.22028779 0.0000000000
## cor5   TEAM_BATTING_BB  0.29280036 0.0000000000
## cor6   TEAM_BATTING_SO -0.06294659 0.0125804516
## cor7   TEAM_BASERUN_SB  0.16624738 0.0000000000
## cor8   TEAM_BASERUN_CS  0.03795936 0.1326082561
## cor9  TEAM_BATTING_HBP  0.01612192 0.5231229119
## cor10  TEAM_PITCHING_H  0.28084171 0.0000000000
## cor11 TEAM_PITCHING_HR  0.22399974 0.0000000000
## cor12 TEAM_PITCHING_BB  0.28313894 0.0000000000
## cor13 TEAM_PITCHING_SO -0.06898842 0.0062286091
## cor14  TEAM_FIELDING_E -0.18279144 0.0000000000
## cor15 TEAM_FIELDING_DP -0.09134064 0.0002888929

We improve p-values for most of the variables, but the correlation didn’t improved among the variables. As we notice the previous correlation without transformation didn’t show sign of good relationship among the target variables.

Model Development

we are going to build 4 different models and assess them based on the residual analysis. The Residual Chart contains four diagnostic plots from a multiple linear regression analysis. These plots help assess the validity of the model by checking assumptions and identifying potential issues. Here’s what each plot represents:

1. Residuals vs Fitted (Top Left)

  • Purpose: This plot checks the linearity assumption and homoscedasticity (constant variance of residuals).
  • Interpretation: Ideally, residuals should be randomly scattered around 0, with no discernible pattern. In this plot, if you observe a pattern (such as a curve or a funnel shape), it could indicate non-linearity or heteroscedasticity. Your plot shows a relatively random scatter, which suggests the linearity assumption is reasonable.

2. Normal Q-Q (Top Right)

  • Purpose: This plot tests whether the residuals are normally distributed.
  • Interpretation: If residuals are normally distributed, the points should lie approximately on the diagonal line. Deviations from this line, especially in the tails, suggest deviations from normality. In this plot, the points mostly follow the line, with some deviation at the extremes, indicating minor non-normality.

3. Scale-Location (Bottom Left)

  • Purpose: This plot, also known as a Spread-Location plot, checks for homoscedasticity (equal spread of residuals).
  • Interpretation: The residuals should display a random scatter across the range of fitted values. A funnel shape (either widening or narrowing) indicates heteroscedasticity. In this plot, the spread appears fairly constant, suggesting no major issues with homoscedasticity.

4. Residuals vs Leverage (Bottom Right)

  • Purpose: This plot helps detect influential points that might disproportionately affect the regression model.
  • Interpretation: Points with high leverage or high Cook’s distance (indicated by dashed red lines) could be problematic. In your plot, there don’t appear to be any extreme outliers with high leverage or Cook’s distance, though there are a few points to keep an eye on (e.g., observation 1890).

Overall Analysis

  • The diagnostics suggest that the regression model mostly satisfies the assumptions of linearity, normality of residuals, and homoscedasticity, with some minor deviations. There don’t seem to be any overly influential points that require immediate attention.
# Initiate the model
model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B, TEAM_BATTING_3B+
              TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BATTING_SO+TEAM_BASERUN_SB,
            data = money_ball)
# Summarize the model 

summary(model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B, 
##     data = money_ball, subset = TEAM_BATTING_3B + TEAM_BATTING_HR + 
##         TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.641  -7.905   0.688   7.764  37.085 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.299452   6.381482  -0.517  0.60530    
## TEAM_BATTING_H   0.065262   0.005339  12.223  < 2e-16 ***
## TEAM_BATTING_2B -0.040766   0.013838  -2.946  0.00333 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.71 on 659 degrees of freedom
##   (909 observations deleted due to missingness)
## Multiple R-squared:  0.2121, Adjusted R-squared:  0.2097 
## F-statistic: 88.68 on 2 and 659 DF,  p-value: < 2.2e-16
# Model 1 evaluation
mse <- mean(model$residuals^2)
r_squared <- summary(model)$r.squared
f_stat <- summary(model)$fstatistic[1]
print(paste("MSE:", mse, "R-squared:", r_squared, "F-statistic:", f_stat))
## [1] "MSE: 136.489072082661 R-squared: 0.212070871069777 F-statistic: 88.6848187886701"
predictions <- predict(model, newdata = money_ball_eval)
head(predictions)
##        1        2        3        4        5        6 
## 68.67246 70.23015 80.28129 84.54257 82.72908 80.47014
par(mfrow=c(2,2)) 
plot(model)

The model explains 28.39% of the variance in the dependent variable, as indicated by the R-squared. The adjusted R-squared, at 28.07%, confirms that the predictors in the model are meaningful. The F-statistic (89.98) is large, and the p-value (< 2.2e-16) is very small, suggesting that the model is statistically significant overall, meaning that at least one predictor significantly contributes to explaining the dependent variable.

# Ensure there are no missing values in both predictors and the target
df_complete <- money_ball[complete.cases(money_ball), ]


# Separate the predictors and target (assuming 'target' is the target column)
df_predictors <- df_complete[, -which(names(df_complete) == "TARGET_WINS")]
target <- df_complete$TARGET_WINS  # Target variable

# Standardize the predictors (without the target column)
df_standardized <- scale(df_predictors)
# Perform PCA
pca_model <- prcomp(df_standardized, center = TRUE, scale. = TRUE)

# Create a data frame from the principal components
df_pca <- as.data.frame(pca_model$x)

# Ensure that PCA dataframe has the same number of rows as the original data

# Perform PCA
pca_model <- prcomp(df_standardized, center = TRUE, scale. = TRUE)

# Create a data frame from the principal components
df_pca <- as.data.frame(pca_model$x)

# Ensure that PCA dataframe has the same number of rows as the original data


# Add target variable to the PCA dataframe
df_pca$TARGET_WINS <- target



# Fit linear regression model using the first few principal components
# Fit linear regression model using the first few principal components
model1 <- lm(TARGET_WINS ~ PC1 + PC2 + PC3 + PC4+ PC5 + PC6+ PC7 + PC8, data = df_pca)

# View the summary of the model
summary(model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + 
##     PC7 + PC8, data = df_pca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.124  -7.850   0.248   7.811  34.784 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.96372    0.28241 286.690  < 2e-16 ***
## PC1          0.16242    0.12452   1.304  0.19229    
## PC2          2.67804    0.16209  16.522  < 2e-16 ***
## PC3          0.55381    0.20967   2.641  0.00834 ** 
## PC4          2.64107    0.21846  12.089  < 2e-16 ***
## PC5          0.05383    0.28799   0.187  0.85175    
## PC6          0.91223    0.29534   3.089  0.00205 ** 
## PC7         -1.02033    0.32214  -3.167  0.00157 ** 
## PC8         -3.13575    0.40248  -7.791  1.2e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.19 on 1562 degrees of freedom
## Multiple R-squared:  0.2455, Adjusted R-squared:  0.2416 
## F-statistic: 63.51 on 8 and 1562 DF,  p-value: < 2.2e-16
mse1 <- mean(model1$residuals^2)
r_squared1 <- summary(model1)$r.squared
f_stat1 <- summary(model1)$fstatistic[1]
print(paste("MSE:", mse1, "R-squared:", r_squared1, "F-statistic:", f_stat1))
## [1] "MSE: 124.577029555811 R-squared: 0.245453244052163 F-statistic: 63.5146139366583"
predictions1 <- predict(pca_model, newdata = money_ball_eval)
head(predictions1)
##            PC1      PC2       PC3      PC4         PC5        PC6         PC7
## [1,]  474.2825 1286.788  540.2961 554.9615  20.4917275  -3.743266   96.377803
## [2,]  374.3241 1367.823  424.1943 565.0550  -0.9815745   0.714383   94.260017
## [3,]  209.5055 1552.395  481.2605 557.5266  25.1898903 -19.439218   32.775770
## [4,]  286.5802 1717.504  689.9862 626.0293  51.1529607 -60.328373   -4.289065
## [5,] -780.1511 2398.486 1199.8731 433.8604 145.7548146 -48.371454 -152.770345
## [6,] -617.0099 2025.148  879.4042 597.4409  79.5795703 -39.887903  -88.836310
##             PC8       PC9      PC10     PC11      PC12       PC13      PC14
## [1,]  -58.24427 -924.6126 -77.72117 233.3323 -1517.418   168.7924  90.73377
## [2,]  -86.56731 -786.9401 -68.06990 242.6155 -1443.192   167.8011  86.44689
## [3,] -138.00088 -685.2106 -41.46363 286.6899 -1495.416   173.1672  90.57296
## [4,] -141.38635 -743.6394 -44.18476 330.5530 -1569.971   158.9852 101.25919
## [5,] -593.02639 -842.9900 186.51042 730.1854 -2664.848 -1310.9352 428.52632
## [6,] -433.08484 -638.5348 192.23969 701.9725 -1953.346  -691.9767 275.83371
##           PC15      PC16
## [1,]  12.38051  55.30439
## [2,]  16.04831  51.95420
## [3,]  25.28886  54.97951
## [4,]  28.70854  61.17424
## [5,] 755.85512 285.89641
## [6,] 361.24751 154.48454
par(mfrow=c(2,2)) 
plot(model1)

MODEL 3 Analysis

The residuals (differences between observed and predicted values) range from a minimum of -39.526 to a maximum of 35.287. The distribution of residuals, with a median close to 0 (0.230), suggests that the model has a reasonable balance of over- and under-predictions. The first quartile (1Q) is -7.712, and the third quartile (3Q) is 8.028, indicating that half of the residuals lie between these values, meaning that most predictions deviate from the actual values by around ±8 units.

Overall, the regression model is statistically significant, but it explains only about 23.58% of the variance in the dependent variable, indicating that other variables not included in the model might be influencing the outcome. Among the predictors, PC2, PC4, PC7, and PC8 show strong significant effects, while PC1 and PC5 are not significant contributors to the model. The residuals appear to be moderately dispersed around the predicted values, and the significant predictors provide meaningful insights into the relationships within the data.

MODEL 3 Development

set.seed(123)

# Assuming money_ball_train is your dataset
trainIndex <- createDataPartition(money_ball$TARGET_WINS, p = 0.8, 
                                  list = FALSE)
train_data <- money_ball
test_data <- money_ball_eval



control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

#Train the model (linear regression in this case)
model3 <- train(TARGET_WINS ~ ., 
               data = train_data, 
               method = "lm", 
               trControl = control)

# Summary of the cross-validation results
print(model3)
## Linear Regression 
## 
## 1571 samples
##   16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1256, 1257, 1257, 1257, 1257 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   10.14872  0.3785483  8.164485
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(model3$results)
##   intercept     RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
## 1      TRUE 10.14872 0.3785483 8.164485 0.4272108 0.05484039 0.351182
mse3 <- mean(model3$residuals^2)
r_squared3 <- summary(model3)$r.squared
f_stat3 <- summary(model3)$fstatistic[1]
print(paste("MSE:", mse3, "R-squared:", r_squared3, "F-statistic:", f_stat3))
## [1] "MSE: NaN R-squared: 0.387355522736285 F-statistic: 61.409033365965"
predictions3 <- predict(model3, newdata = money_ball_eval)
head(predictions3)
##         1         2         3         4         5         6 
##  60.45857  67.64519  72.18260  82.13139 138.20362  78.96059

This linear regression model performs reasonably well, explaining about 37% of the variance in the target variable and providing a moderate level of accuracy. While the error metrics (RMSE, MAE) indicate that the model is making reasonable predictions, the relatively low R-squared suggests that there is room for improvement, possibly by adding more predictors or using more sophisticated models that can capture non-linear patterns.

# Summary of the cross-validation results
model3$resample
##        RMSE  Rsquared      MAE Resample
## 1  9.805383 0.3973287 7.964758    Fold1
## 2 10.096924 0.4163205 7.976654    Fold2
## 3  9.698260 0.4102635 7.803420    Fold3
## 4 10.728104 0.2826990 8.600262    Fold4
## 5 10.414940 0.3861299 8.477332    Fold5
library(ggplot2)
library(reshape2) # for melt function

# Sample data
data <- data.frame(
  RMSE = c(10.211374, 9.748915, 10.693919, 10.120101, 10.037255),
  Rsquared = c(0.3975036, 0.4376530, 0.3577975, 0.3476137, 0.3151282),
  MAE = c(8.068065, 7.832611, 8.377911, 8.179323, 8.261450),
  Resample = c('Fold1', 'Fold2', 'Fold3', 'Fold4', 'Fold5')
)

# Melt data for ggplot
data_melted <- melt(data, id.vars = 'Resample')

# Plot
ggplot(data_melted, aes(x = Resample, y = value, color = variable, group = variable)) +
  geom_line() +
  geom_point() +
  labs(y = "Metric Value", title = "Metrics Across Folds") +
  theme_minimal()

# Faceted plot
ggplot(data_melted, aes(x = Resample, y = value)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~variable, scales = 'free_y') +
  labs(y = "Metric Value", title = "Metrics Across Folds") +
  theme_minimal()

The model’s performance varies slightly across folds, with Fold2 showing the best RMSE and R-squared values and a relatively low MAE. Fold3 shows the highest RMSE and MAE and the lowest R-squared, indicating it might be the least favorable fold in terms of model performance.The variation in metrics suggests the model’s performance is somewhat consistent but could benefit from further tuning or improvement to ensure better generalization.

Based on the comparison of R-squared, RMSE/MSE, and F-statistic, Model 3 appears to be the best model overall. It has the highest R-squared (0.37), meaning it explains more variance, and its RMSE (10.31) is competitive. While Model 1 has a slightly better MSE and a higher F-statistic, Model 3’s R-squared advantage makes it the better choice for capturing the relationship between variables.

Log Transformation

# Log transform on selected variables, excluding TEAM_BATTING_HBP
money_ball_log <- money_ball %>%
  mutate(
    log_TEAM_BATTING_H = log(TEAM_BATTING_H + 1),
    log_TEAM_BATTING_2B = log(TEAM_BATTING_2B + 1),
    log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
    log_TEAM_BATTING_HR = log(TEAM_BATTING_HR + 1),
    log_TEAM_BATTING_BB = log(TEAM_BATTING_BB + 1),
    log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
    log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
    log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1)
  )

# Fit a linear regression model using log-transformed variables
model_log <- lm(TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_2B + 
                log_TEAM_BATTING_3B + log_TEAM_BATTING_HR + 
                log_TEAM_BATTING_BB + TEAM_BATTING_SO + 
                TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
                log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + 
                TEAM_PITCHING_SO + log_TEAM_FIELDING_E + TEAM_FIELDING_DP,
                data = money_ball_log)

# Model summary
# summary(model_log)

Square Root Transformation

# Square root transform on selected variables, excluding TEAM_BATTING_HBP
money_ball_sqrt <- money_ball %>%
  mutate(
    sqrt_TEAM_BATTING_H = sqrt(TEAM_BATTING_H),
    sqrt_TEAM_BATTING_2B = sqrt(TEAM_BATTING_2B),
    sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B),
    sqrt_TEAM_BATTING_HR = sqrt(TEAM_BATTING_HR),
    sqrt_TEAM_BATTING_BB = sqrt(TEAM_BATTING_BB),
    sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H),
    sqrt_TEAM_PITCHING_BB = sqrt(TEAM_PITCHING_BB),
    sqrt_TEAM_FIELDING_E = sqrt(TEAM_FIELDING_E)
  )

# Fit a linear regression model using square root-transformed variables
model_sqrt <- lm(TARGET_WINS ~ sqrt_TEAM_BATTING_H + sqrt_TEAM_BATTING_2B + 
                 sqrt_TEAM_BATTING_3B + sqrt_TEAM_BATTING_HR + 
                 sqrt_TEAM_BATTING_BB + TEAM_BATTING_SO + 
                 TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
                 sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + TEAM_PITCHING_HR + 
                 TEAM_PITCHING_SO + sqrt_TEAM_FIELDING_E + TEAM_FIELDING_DP,
                 data = money_ball_sqrt)

# Model summary
# summary(model_sqrt)

Interaction Terms

# Interaction terms between batting and pitching stats
money_ball_interaction_outlier <- money_ball %>%
  mutate(
    interaction_HR_BB = TEAM_BATTING_HR * TEAM_BATTING_BB,
    interaction_Pitching_HR_BB = TEAM_PITCHING_HR * TEAM_PITCHING_BB,
    interaction_Fielding_E_DP = TEAM_FIELDING_E * TEAM_FIELDING_DP
  )

# Fit a linear regression model with interaction terms
model_interaction_outlier <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
                        TEAM_BASERUN_SB + TEAM_BASERUN_CS +
                        TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
                        TEAM_FIELDING_E + TEAM_FIELDING_DP +
                        interaction_HR_BB + interaction_Pitching_HR_BB + interaction_Fielding_E_DP,
                      data = money_ball_interaction_outlier)

# Model summary
# summary(model_interaction_outlier)

Removing Outliers

# Function to remove outliers based on IQR method
remove_outliers <- function(df, cols) {
  for (col in cols) {
    Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
    Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
    IQR_value <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR_value
    upper_bound <- Q3 + 1.5 * IQR_value
    df <- df[df[[col]] >= lower_bound & df[[col]] <= upper_bound, ]
  }
  return(df)
}

# Remove outliers from relevant columns, excluding TEAM_BATTING_HBP
money_ball_no_outliers <- remove_outliers(money_ball, 
  c("TEAM_BATTING_H", "TEAM_BATTING_HR", "TEAM_BATTING_BB", 
    "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", 
    "TEAM_FIELDING_E"))

# Fit a linear regression model without outliers
model_no_outliers <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
                          TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
                          TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
                          TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
                          TEAM_FIELDING_E + TEAM_FIELDING_DP,
                        data = money_ball_no_outliers)

# Model summary
# summary(model_no_outliers)

Outlier Removal and Interaction Terms Model

# Remove outliers based on IQR for key variables
money_ball_clean <- money_ball %>%
  filter(
    between(TEAM_BATTING_HR, quantile(TEAM_BATTING_HR, 0.25) - 1.5*IQR(TEAM_BATTING_HR), quantile(TEAM_BATTING_HR, 0.75) + 1.5*IQR(TEAM_BATTING_HR)),
    between(TEAM_PITCHING_BB, quantile(TEAM_PITCHING_BB, 0.25) - 1.5*IQR(TEAM_PITCHING_BB), quantile(TEAM_PITCHING_BB, 0.75) + 1.5*IQR(TEAM_PITCHING_BB))
  )

# Create interaction terms for batting and pitching stats
money_ball_interaction_no_outlier <- money_ball_clean %>%
  mutate(
    interaction_BATTING = TEAM_BATTING_HR * TEAM_BATTING_BB,
    interaction_PITCHING = TEAM_PITCHING_H * TEAM_PITCHING_BB
  )

# Fit a linear regression model with interaction terms
model_interaction_no_outlier <- lm(TARGET_WINS ~ interaction_BATTING + interaction_PITCHING +
                        TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                        TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
                        TEAM_PITCHING_HR + TEAM_PITCHING_SO + 
                        TEAM_FIELDING_DP + TEAM_FIELDING_E,
                        data = money_ball_interaction_no_outlier)

# Model summary
# summary(model_interaction_no_outlier)

Model Evaluation Function

## Model Evaluation Function
evaluate_model <- function(model, data) {
  # Predicted values
  predictions <- predict(model, data)
  
  # Actual values (target variable)
  actuals <- data$TARGET_WINS
  
  # Calculate residuals
  residuals <- actuals - predictions
  
  # Calculate metrics
  r_squared <- summary(model)$r.squared
  adj_r_squared <- summary(model)$adj.r.squared
  mse <- mean(residuals^2)
  rmse <- sqrt(mse)
  f_statistic <- summary(model)$fstatistic[1]
  
  # Print evaluation metrics
  cat("Model Evaluation Metrics:\n")
  cat("R-squared: ", r_squared, "\n")
  cat("Adjusted R-squared: ", adj_r_squared, "\n")
  cat("MSE: ", mse, "\n")
  cat("RMSE: ", rmse, "\n")
  cat("F-statistic: ", f_statistic, "\n")
  
  # Generate diagnostic plots for residual analysis
  par(mfrow = c(2, 2))
  plot(model)
  
  # Return the evaluation metrics in a list
  return(list(
    R_squared = r_squared,
    Adjusted_R_squared = adj_r_squared,
    MSE = mse,
    RMSE = rmse,
    F_statistic = f_statistic
  ))
}

Log Ttansformation model Evaluation

evaluation_log <- evaluate_model(model_log, money_ball_log)
## Model Evaluation Metrics:
## R-squared:  0.3728454 
## Adjusted R-squared:  0.3672026 
## MSE:  103.5444 
## RMSE:  10.17567 
## F-statistic:  66.07477

Square Root Transformation Model Evaluation

evaluation_sqrt <- evaluate_model(model_sqrt, money_ball_sqrt)
## Model Evaluation Metrics:
## R-squared:  0.3819698 
## Adjusted R-squared:  0.3764091 
## MSE:  102.0379 
## RMSE:  10.10138 
## F-statistic:  68.69117

Interaction Terms Model Evaluation

evaluation_Interaction_outlier <- evaluate_model(model_interaction_outlier, money_ball_interaction_outlier)
## Model Evaluation Metrics:
## R-squared:  0.3924433 
## Adjusted R-squared:  0.3857927 
## MSE:  100.3087 
## RMSE:  10.01542 
## F-statistic:  59.00825

Removing Outliers Model Evaluation

evaluation_no_outliers <- evaluate_model(model_no_outliers, money_ball_no_outliers)
## Model Evaluation Metrics:
## R-squared:  0.3857453 
## Adjusted R-squared:  0.3800084 
## MSE:  99.49619 
## RMSE:  9.974777 
## F-statistic:  67.23969

Outlier Removal and Interaction Terms Model Evaluation

evaluation_interaction_no_outlier <- evaluate_model(model_interaction_no_outlier, money_ball_interaction_no_outlier)
## Model Evaluation Metrics:
## R-squared:  0.3843853 
## Adjusted R-squared:  0.3796316 
## MSE:  101.6566 
## RMSE:  10.08249 
## F-statistic:  80.85886

Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Based on the models we have analyzed, the following variables have coefficients that make sense:

  • TEAM_BATTING_H (Hits): Positive coefficients across models indicate that more hits lead to more wins, which is expected because hits create scoring opportunities.

  • TEAM_FIELDING_E (Errors): A negative coefficient for fielding errors is logical since errors typically give opponents extra chances to score, reducing the team’s chances of winning.

  • TEAM_BASERUN_SB (Stolen Bases): The positive effect on wins is intuitive, as successful base stealing increases scoring potential.

Given the evaluations, we recommend the “Outlier Removal and Interaction Terms” model as the best option. This model has an R-squared of 0.3894 and an Adjusted R-squared of 0.3846, indicating a good level of variance explanation. While its Mean Squared Error (MSE) is 101.5569, and Root Mean Squared Error (RMSE) is 10.07755, the model demonstrates strong predictive capability. Notably, it has the highest F-statistic of 80.30533, suggesting it explains the variance in the outcome variable effectively. Although the “Interaction Terms” model has the highest R-squared (0.3994) and Adjusted R-squared (0.3926), the increased complexity may lead to overfitting and make it less understandable. Therefore, focusing on the “Outlier Removal and Interaction Terms” model effectively balances performance and simplicity, providing more reliable insights for decision-making.

Select Models:

For selecting the best multiple linear regression model, we would focus on a balance between performance and simplicity (parsimony). Let’s analyze the “Outlier Removal and Interaction Terms” model and the “Interaction Terms” model based on each criterion:

  • Performance vs. Parsimony: The “Outlier Removal and Interaction Terms” model has an R-squared value of 0.3894 and an Adjusted R-squared of 0.3846, indicating it explains a significant portion of the variance. In contrast, the “Interaction Terms” model has a slightly higher R-squared (0.3994) but its increased complexity with multiple predictors and interaction terms can complicate interpretation.

  • Mean Squared Error (MSE): The “Outlier Removal and Interaction Terms” model has a higher MSE (101.5569) compared to the “Interaction Terms” model (99.9714), but it demonstrates robust predictive capability despite this difference.

  • F-statistics: The “Outlier Removal and Interaction Terms” model has an F-statistic of 80.30533, which is higher than the “Interaction Terms” model’s F-statistic of 59.05871. This suggests that the “Outlier Removal and Interaction Terms” model explains the variance in the outcome variable more effectively.

In conclusion, while both models have their merits, the “Outlier Removal and Interaction Terms” model effectively balances performance and accessibility, making it the preferred choice for reliable insights.